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1 Introduction 

Smoothing Spline ANOVA (SS-ANOVA) models in reproducing kernel Hilbert spaces (RKHS) 
provide a very general framework for data analysis, modeling and learning in a variety of fields. 
Discrete, noisy scattered, direct and indirect observations can be accommodated with multiple 
inputs and multiple possibly correlated outputs and a variety of meaningful structures. The purpose 
of this paper is to give a brief overview of the approach and describe and contrast a series of 
applications, while noting some recent results. 

2 The general SS-ANOVA model 

The SS-ANOVA model with Gaussian data has the form 

Vi = f(h{i),--- ,t d (i)) +ei, i = l,---,n, (1) 

where e = (ei, ■■■,e n )' ~ V(0, (T^/ nxn ), t a (E 1~( a \ where / 7~( Q ) is a measurable space, o. — 
l,--- ,d;(h,--- ,t d ) = teT = TW ® ••• <g> TW, and cr 2 may be unknown. For / satisfying some 
measurability conditions a unique ANOVA decomposition of / of the form 

f(t U ■ ■ ■ , t d ) = H + fa(ta) + E fafiM + ' ' " (2) 

can always be defined as follows: Let djj, a be a probability measure on and define the averaging 
operator £ a on T by 

(£ a f)(t) = [ f(t 1 ,---,t d )d f x a (t a ). (3) 
Then the identity is decomposed as 

/=n(£a+(i-£a))=n^+E( 7 -^) n ^ w 

a a a f3^ a 

+ j2(i-s a )(i-Si3) n £y+---+n( j -^)- (5) 

The components of this decomposition generate the ANOVA decomposition of / of the form (2) 

b y C = (Ua £ cx)f,fa = {{I ~ £a)Uf3jta S l3)fJaf3 = {(I ~ £a){I ~ £/?) Ily^ £ l)f, and SO forth - 



Further details in the RKHS context may be found in Wahba (1990)Gu & Wahba (1993)Wahba, 
Wang, Gu, Klein k Klein (1995) 

The idea behind SS-ANOVA is to construct an RKHS 7i of functions on T so that the com- 
ponents of the SS-ANOVA decomposition represent an orthogonal decomposition of / in H. Then 
RKHS methods can be used to explicitly impose smoothness penalties of the form J2 a ^aJa(fa) + 
J2a/3 ^apJapifap) + " " "j where, however, the series will be truncated at some point. This is done 
as follows: Let be an RKHS of functions on 7» with j T(a) f a (t a )dfi a = for f a (t a ) € H {a \ 
and let [1^] be the one dimensional space of constant functions on . Construct H as 

w = n({[i<°>]} ©{«<">}) (6) 
= [i] e Y n {a) © Y i n(a) ® h(/3) ] © • • • > ( ? ) 

j a</3 

where [1] denotes the constant functions on T. With some abuse of notation, factors of the form 
[l( Q )] are omitted whenever they multiply a term of a different form. Thus is a shorthand for 
<g> • • • © [l^- 1 )] (8>H (q) © © • • • © [lW] (which is a subspace of W). The components of the 

AN OVA decomposition are now in mutually orthogonal subspaces of TL. Note that the components 
will depend on the measures d[x a and these should be chosen in a specific application so that the 
fitted mean, main effects, two factor interactions, etc. have reasonable interpretations. 

Next, is decomposed into a parametric part and a smooth part, by letting = © 
Tii a \ where is finite dimensional (the "parametric" part) and Hs°^ (the "smooth" part) is 
the orthocomplement of fiffi in Elements of are not penalized through the device of 

letting J a {fa) = \\Ps a ^ fa\\ 2 where P^ is the orthogonal projector onto 7ii a \ [H^ is now 

a direct sum of four orthogonal subspaces: [H^ © H^} = [H^ © H^] © [H^ © H^] © [H^ © 
7~t^] © [Hs°^ © Tii^]. By convention the elements of the finite dimensional space [H^QH^] will 
not be penalized. Continuing this way results in an orthogonal decomposition of H into sums of 
products of unpenalized finite dimensional subspaces, plus main effects 'smooth' subspaces, plus 
two factor interaction spaces of the form parametric © smooth [TC^ ®nf£^\, smooth © parametric 
[Ti-i^ © H^] and smooth © smooth [Hs°^ © Tii^] and similarly for the three and higher factor 
subspaces. 

Now suppose that we have selected the model Ai, that is, we have decided which subspaces 
will be included. Collect all of the included unpenalized subspaces into a subspace, call it 7i°, of 
dimension M, and relabel the other subspaces as W 3 , (3 = 1, 2, ■ • • ,p. Hr may stand for a subspace 
, or one of the three subspaces in the decomposition of [H^ <S>H^] which contains at least one 
'smooth' component, or, a higher order subspace with at least one 'smooth' component. Collecting 
these subspaces as M = H° © Y^p ^ \ t ne estimation problem in the Gaussian case becomes: Find 
/ in M = H° © E/3 to minimize 

i n V 

Y(m- um)) 2 + ^Y^w^n 2 ' (8) 

U i=l 0=1 

where P 13 is the orthogonal projector in M onto H 13 , and choose the (overparameterized) tuning 
parameters A, 9p. Bayesian confidence intervals, with the so-called 'across the function' property, 
are available for these models. 
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The residual sum of squares (RSS) in (8) is replaced by the log likelihood 



n 



C(yj) = -Y,[mf(m)-Hf(t)))} 



(9) 



1=1 



for data from exponential families. Some of the examples below will involve Bernoulli (0, 1) data, 
in which case b(f) = log(l + e*). Software for computing and tuning SS-ANOVA models may be 
found in the codes GRKPACK, RKPACK and gss and elsewhere, links to these and other spline 
related codes can be found via 

http://www.stat.wisc.edu/~wahba goto "SOFTWARE". Tuning methods are discussed in the 
first talk in this session. RSS may be replaced by robust functionals, or any convex functionals 
satisfying some mild conditions insuring uniqueness, and, in recent work on classification by support 
vector machines, RSS is replaced by so-called hinge functions. 

3 Applications in Environmental Data 

Gu & Wahba (1993) considered data from the Eastern Lake Survey of 1984 which gave water 
acidity measurements and geographic locations, and other measurements of lakes in the Blue Ridge 
Mountains area. Of interest is the pH as it depends on the geographic location and calcium 
concentration in the lakes. Model diagnostics were proposed there, and the model 



was chosen, where t\ is calcium content and t 2 is the pair (latitude, longitude). The thin plate 
spline penalty was imposed on the spatial variable. The calcium content and geography main effects 
models were plotted, and it can be seen that geography is a near proxy for elevation along the Blue 
Ridge mountains. 

4 Risk factor estimation 

Wahba et al. (1995) considered the risk of progression of diabetic retinopathy in a subpopulation 
of the Wisconsin Epidemiological Study of Diabetic Retinopathy, whose baseline retinopathy score 
was below (i. e. good) a prespecified level. The observations were yi = 1 if the ith person's 
retinopathy progressed at the first followup, and if it had not. Here / is the log odds ratio, 
/ = log\p/(l — p)\. Three important variables were identified by informal means (see Section 9) 
and were t\ = duration of diabetes, t 2 = glycosylated hemoglobin, and t% = body mass index, and 
was modeled as 



An interesting scientific result was found, that, persons in the study group with the longest duration 
of diabetes were at a lower risk, possibly because they had survived longest without exceeding the 
prespecified threshold. 

5 Time and Space Models on the Globe 

In Wahba & Luo (1997)Luo, Wahba & Johnson (1997) thirty years (1961-90) of Dec. Jan. Feb. 
average temperature measurements at 1000 stations around the globe (with missing data) was 



Vi = h(h(i)) + / 2 (t2(*)) + fo(h(i)Mi)) + e * 



(10) 



f(t) = n + h(t a ) + a 2 t 2 + / 3 (t 3 ) + fi3(h,t 3 ). 



(11) 
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analyzed for spatial trends, as well as a global trend. Here t = (ii,t2) = ( X >P) where x is 
year, and P is (latitude, longitude). The RKHS of historical global temperature functions is 
H = [[lWle^ewFl^dl^lewf], a collection of functions f(x,P), on {1, 2, 30} <g><S, where 
S is the sphere, and TL and / have corresponding decompositions given below: 



n = 


[i] 


e [(/)} e 




e 




e 


M ® M 2) ] e 
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Here <^ is a linear function which averages to 0. A sum of squares of second differences was applied 
to the time variable, and a spline on the sphere penalty (Wahba (1981)Wahba (1982)) was applied 
to the space variable. For a cross country skiier in the midwest, as this author is, the results 
were very disappointing, in that they clearly showed a warming trend stretching from the midwest 
towards Alaska (trend by space term) which was stronger than the global mean trend. 

6 Multiple correlated Bernoulli outcomes 

Gao, Wahba, Klein & Klein (2001) were motivated by a demographic study involving a population 
with a variety of observed risk factors for several particular eye diseases, the outcomes were the 
incidence of one or more of several diseases or conditions in either or both of two eyes. Outcomes 
of the two eyes in a particular subject are presumed to be correlated, and incidences of the various 
outcomes may also be correlated. The amount of correlation may be of particular interest. The risk 
factors could be person specific or eye-specific. The "two-eye" methods are a special case of what 
might be called "k-eye" methods where one person (unit) has several component outcomes which 
might have correlated outcomes, depending on unit-specific and component specific risk factors. 

The general log-linear model for multivariate Bernoulli data goes as follows: Assuming there 
are J different endpoints, and Kj repeated measurements for the j'th endpoint, let Yjk denote 
the fcth measurement of the jth endpoint. For example, in ophthalmological studies, we have 
two repeated measurements for each disease: left eye and right eye. In a typical longitudinal 
study, we have repeated measurements over the time. Y = (Yjk,j = l,...,J,k = l,...,Kj) is a 
multivariate Bernoulli outcome variable. Let Xjk = (Xjki, Xj^, ...,XjkD) be a vector of predictor 
variables ranging over the subset X of 7Z D , where Xj^ denotes the dth predictor variable for 
the kth measurement of the jth endpoint. Some predictor variables may take different values for 
different measurements while others may be the same for all Y^'s. For example, in ophthalmology 
studies, there may be present both person-specific predictors and eye-specific predictors. The 
person-specific predictors are the same for each person. For the eye-specific predictors, the set 
of predictor variables is the same, but they may take different values for the left and right eyes. 
We can treat observations from both eyes as correlated repeated measurements in our model. Let 
X = (Xjk,j = l,...,J,k = l,...,Kj). Then (X, Y) is a pair of random vectors. For a response 
vector y = (yjk,j = l,-.-,J,k = l,...,Kj), its joint probability distribution conditioning on the 
predictor variables X can be written as 

P(Y = y\X) = 
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J Kj J 

ex p{J2J2fjkyjk + Yl X! ^i^^i^ 

j=lk=l j=lfei<fe2 

+ a hki,j2k2Vjik l yj 2 k2 + ••• + ®U,12,...,JKjyiiyi2--yjKj 

j\<32 ki,k 2 

-Kf,a)}, 

where 

Kf,a)= (12) 

} + ... 

Let M = X)/=i be the length of the vector F. There are in total 2 M — 1 parameters: (/, a) = 
(/n, /12, fjKj, an, 12) ■••) 12,..., JKj), which may depend on X. The parameter space is un- 
constrained. They have straightforward interpretations in terms of conditional probabilities. For 
example, 

f jk = logit{P{Y ]k = = 0, X)) (13) 

is the conditional logit function; 

a jlkl , hk2 = logOR(Y jlkl ,Y j2k2 \Y^>-^ = 0,X) (14) 

is the conditional log odds ratio, which is a meaningful way to measure pairwise association; inter- 
pretations of other terms are given in the paper. 

n independent observations (xi,yi),i = l,...,n, are given, where yi = (ym, yu2, yuKj) and 
Xi = (xiniXiw^-iXijKj). Here y ijk and x ijk = (x ijkl , x ijk2 , x ijkD ) are the outcome variable 
and predictor vector for the A;th measurement of the jth endpoint of the ith subject. Let fj k (i) 
be the conditional logit function for the kth measurement of the jth endpoint of the ith subject. 
There is little reason to believe the fj k will take different functional forms for the same endpoint. 
Hence we can assume fij k = fj(xij k ). The same reasoning applies to the association terms. The fj k 
were modeled via SS-ANOVA in the paper, and a leaving-out-one-person based generalized cross 
validation for the smoothing parameters was obtained. 



7 Multichotomous responses 

Lin (1998) considered multichotomous outcomes, the data is (yi, t(i)) where yi is coded to show that 
the i subject, with attribute vector t(i) is in one of k + 1 categories, k > 1. Let Pj(t),j = 0, 1, ■ ■ ■ , k 
be the probability that a subject with attribute vector t is in category k, Y^j=oPj{^) = 1- Let 
P(t) = log\ Pj (t)/p (t)],j = 1, Then 

nit) = 1 + zU^ ' J = 1 ""' k <15) 
Mt) = ^ (16> 
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The class label for the ith subject is coded as j/j = (yn, • ■ ■ , yik) where yij = 1 if the ith subject is 
in class j and otherwise. Letting / = (/*, • • • , f k ) the negative log likelihood can be written as 

n k k 

Av, /) = Ei- E vaf'lM) + ME i + e fHu) )}. (17) 

i=i j=i j=i 

/ J = Y^?j=i4>v + ^ where the h? can have an ANOVA decomposition. Then A||/i||^ K in (8) is 
replaced by 

k 

E E *j«JjM) + X! ^apJjctfVip) + ■■■■ (is) 

j=l « a</3 

Ten year mortality data of a group of n = 646 subjects with the risk factors age (x\), glycosylated 
hemoglobin (2:2) and systolic blood pressure (2:3) were (among other things) recorded at baseline 
and they were divided into four categories with respect to their status after ten years, as =alive, 
1 = died of diabetes, 2 =died of heart disease, and 3 =died of other causes. Each of the f J , j = 1, 2, 3 
was modeled as P(x 1 ,x 2 ,x 3 ) = H 3 + f{{xi) + f 2 {x 2 ) + f 3 z {x 3 ) + f 2Z {x 2 ,x 3 ). ThepjJ = 0, • • • , 3 were 
estimated by minimizing I(y, f) = (17) + (18) and the multiple smoothing parameters estimated 
by a generalized cross validation method for polychotomous data given in Lin (1998). The plots 
graphically convey the suggestion that the younger deaths are disproportionately diabetic, thus 
quickly raising further questions to confront the data base. 

8 The multicategory support vector machine 

The multicategory support vector machine (MSV) proposed in Lee, Lin & Wahba (2002), Lee, 
Lin & Wahba (2001) considers the case where each subject is in one of k categories labeled as 
j = 1, • • ■ , k, as in the preceeding section, except for notational convenience there are k instead 
of k + 1 categories. The support vector machine is an efficient method for classification - it is 
not estimating the probability of membership in a particular category as before, but its target is 
an indicator as to which category as subject is in (or most likely to be in) (see Lin (2002). The 
class label j/j is now coded as a A; dimensional vector with 1 in the jth position if example i is 
in category j and — otherwise. For example yi = (1, — ^rry , • • • , — tzj) indicates that the ith 
example is in category 1. We define a /c-tuple of separating functions f(t) = (/ 1 (t), • • • f k (t)), with 
each = d? + b? with h? G Hk, and which will be required to satisfy a sum-to-zero constraint, 
Ylj=i /■'(*) = 0' f° r an * m T ■ Note that, unlike the estimate of Section 7, all categories are treated 
symmetrically. 

Let Lj r = l,r / j, Ljj = 0, j, r = 1, • • • , k. Let cat(yi) = j if yi is from category j. Then, if 
is from category j, L cat t y .\ r = if r = j and 1 otherwise. Then the MSVM is defined as the vector 
of functions f\ = (f\,---,fx), with each h k in Hr satisfying the sum-to-zero constraint, which 
minimizes 

\ E E L cat{yi)r {f{U) - y ir ) + + A ]T \\V\\ 2 Hk . (19) 

i=l r=l j=l 

Generalizations of the penalty term are possible, if necessary. It can be shown that the k = 2 case 
reduces to the usual 2-category SVM just discussed, and it is shown in Lee et al. (2001) that the 
target for the MSVM is f(t) = (f l (t),---,f k (t)) with P(t) = 1 if pj(t) is bigger than the other 
Pl(t) and P(t) = otherwise. See also Wahba (2002). 
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9 Summary 



The SS-ANOVA models have proved to be useful in a variety of modeling situations, only a few 
described here. In each case a tuning method which governs the bias-variance tradeoff must be 
employed, and, for very large sample sizes, efficient approximate methods need to be devised. 
Model selection, that is, the determination of which variables and/or terms to include in the 
model is an important issue. Zhang, Wahba, Lin, Voelker, Ferris, Klein & Klein (2001)Zhang, 
Wahba, Lin, Voelker, Ferris, Klein & Klein (2002) have recently proposed likelihood basis pursuit, 
a nonparametric form of the LASSO, for the model selection problem associated with SS-ANOVA. 
Although a number of tuning methods for the various situations have been proposed, along with 
numerical methods for large data sets, a variety of problems remain to be investigated, including 
optimum nonlinear transformations of the variables, efficient computational methods, methods for 
covariates not missing at random, and public software for very large sample sizes and for some of 
the more complex structures. 
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